Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature](datalake) Add BucketShuffleJoin support for bucketed hive tables #27784

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Nitin-Kashyap
Copy link
Contributor

@Nitin-Kashyap Nitin-Kashyap commented Nov 29, 2023

Add BucketShuffleJoin support for bucketed hive tables generated by Spark. (27783)

Proposed changes

Issue Number: close #27783

1. Original planner updated to consider BucketShuffle for bucketed hive table
2. Neerids planner updated for bucketShuffle join on hive tables.
3. Added spark style hash calculation in BE for shuffle on one side.

###Sample Output:s
NeredisPlanner
OldPlanner

image001

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/src/vec/columns/column_decimal.cpp Outdated Show resolved Hide resolved
be/src/vec/columns/column_map.cpp Show resolved Hide resolved
be/src/vec/columns/column_string.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from b4464d4 to f9e42ab Compare November 30, 2023 04:48
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/src/vec/columns/column_vector.cpp Outdated Show resolved Hide resolved
@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch 2 times, most recently from ed212e1 to eaf29b0 Compare November 30, 2023 05:47
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@morningman morningman self-assigned this Nov 30, 2023
@morningman
Copy link
Contributor

Hi @Nitin-Kashyap , thanks for your contribution.
Could you please provide some create table stmt of hive table on spark side,
so that we can test this case?

@morningman
Copy link
Contributor

BTW, is it only suitable for "spark created" hive bucket table?
What if the hive table is created by other system with different hash function?

@Nitin-Kashyap
Copy link
Contributor Author

Nitin-Kashyap commented Dec 1, 2023

Hi @Nitin-Kashyap , thanks for your contribution. Could you please provide some create table stmt of hive table on spark side, so that we can test this case?

@morningman Please find the sample test I used for this case: -

CREATE TABLE parquet_test (
     user_id INT,
     key       VARCHAR(20),
     part      VARCAHAR(10)
)
USING parquet
PARTITIONED BY (part)
CLUSTERED BY (user_id) INTO 3 BUCKETS;

INSERT INTO parquet_test2 VALUES (31, 'U31', 'IN'),  (11,'U11','IN'), (21, 'U21', 'IN');

@Nitin-Kashyap
Copy link
Contributor Author

Nitin-Kashyap commented Dec 1, 2023

BTW, is it only suitable for "spark created" hive bucket table? What if the hive table is created by other system with different hash function?

@morningman Yes, for current scope it will understand only Spark created bucketed table, it identifies this by Properties defined by spark for bucket specification.

I plan to take up supporting for Hive, Hudi as well in some time (hopefully in next PR); for this I have left a place holder THashType [HIVE_MOD: Hive and Hudi use the same hash method] however for hudi some more changes on FE side need to do for identifing type bucket id from file path.

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from eaf29b0 to 34c701c Compare December 2, 2023 12:19
Copy link
Contributor

github-actions bot commented Dec 2, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

github-actions bot commented Dec 2, 2023

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 34c701c to d25350a Compare December 4, 2023 05:05
Copy link
Contributor

github-actions bot commented Dec 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

be/src/vec/utils/util.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

github-actions bot commented Dec 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 10db37d to 7784db9 Compare December 13, 2024 08:52
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 7784db9 to 3780f23 Compare December 31, 2024 14:53
@924060929
Copy link
Contributor

You should support the enable_fallback_to_original_planner=true in master branch

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch 3 times, most recently from 8438d6e to 6a9f9b6 Compare February 3, 2025 11:07
@Nitin-Kashyap
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 42.02% (10992/26158)
Line Coverage: 32.30% (92801/287271)
Region Coverage: 31.45% (47584/151300)
Branch Coverage: 27.49% (24089/87632)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6a9f9b65f06e596819d0f4a0c2735a3f72dac9ad_6a9f9b65f06e596819d0f4a0c2735a3f72dac9ad/report/index.html

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 6a9f9b6 to d68ca64 Compare February 3, 2025 16:13
@Nitin-Kashyap
Copy link
Contributor Author

run build all

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from d68ca64 to f5af994 Compare February 3, 2025 16:17
@Nitin-Kashyap
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32237 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f5af9949369589b49af19a7c592b281a686e870d, data reload: false

------ Round 1 ----------------------------------
q1	17845	5596	5417	5417
q2	2066	302	181	181
q3	10712	1236	725	725
q4	10315	963	533	533
q5	9028	2382	2162	2162
q6	202	163	130	130
q7	902	752	593	593
q8	9234	1332	1176	1176
q9	5150	4906	4892	4892
q10	6821	2335	1890	1890
q11	454	279	256	256
q12	343	356	209	209
q13	17761	3715	3088	3088
q14	230	234	217	217
q15	515	485	460	460
q16	614	628	577	577
q17	549	858	314	314
q18	7028	6462	6419	6419
q19	2069	944	524	524
q20	300	313	183	183
q21	2800	2146	1985	1985
q22	370	332	306	306
Total cold run time: 105308 ms
Total hot run time: 32237 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5605	5486	5511	5486
q2	234	322	233	233
q3	2244	2716	2320	2320
q4	1356	1821	1373	1373
q5	4282	4747	4723	4723
q6	172	164	127	127
q7	2096	1967	1773	1773
q8	2615	2884	2751	2751
q9	7325	7224	7338	7224
q10	2988	3297	2810	2810
q11	572	510	493	493
q12	662	730	569	569
q13	3567	4007	3344	3344
q14	284	290	263	263
q15	513	481	462	462
q16	665	669	645	645
q17	1235	1726	1269	1269
q18	7610	7724	7332	7332
q19	796	1119	1059	1059
q20	2021	2037	1927	1927
q21	5707	5047	4957	4957
q22	605	607	581	581
Total cold run time: 53154 ms
Total hot run time: 51721 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191679 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f5af9949369589b49af19a7c592b281a686e870d, data reload: false

query1	1360	962	959	959
query2	6216	1968	1971	1968
query3	10987	4401	4528	4401
query4	61077	29276	23126	23126
query5	5499	616	455	455
query6	442	216	189	189
query7	5539	521	306	306
query8	334	255	232	232
query9	8155	2658	2649	2649
query10	451	313	283	283
query11	17553	15020	15621	15020
query12	166	125	109	109
query13	1413	543	395	395
query14	10954	7372	6802	6802
query15	214	196	196	196
query16	6861	644	469	469
query17	1088	716	565	565
query18	1473	384	322	322
query19	202	186	168	168
query20	123	122	111	111
query21	212	126	103	103
query22	4674	4491	4619	4491
query23	34219	33375	33893	33375
query24	5534	2299	2312	2299
query25	448	449	391	391
query26	647	275	153	153
query27	1590	492	332	332
query28	4374	2530	2525	2525
query29	531	554	436	436
query30	212	187	162	162
query31	961	881	811	811
query32	75	59	60	59
query33	411	364	296	296
query34	731	854	524	524
query35	813	837	774	774
query36	1005	1089	941	941
query37	117	104	79	79
query38	4297	4366	4403	4366
query39	1545	1441	1442	1441
query40	201	113	103	103
query41	54	62	54	54
query42	131	104	102	102
query43	530	519	499	499
query44	1338	857	869	857
query45	195	179	164	164
query46	869	1058	647	647
query47	1913	1894	1822	1822
query48	429	410	333	333
query49	713	485	395	395
query50	654	677	400	400
query51	4308	4247	4240	4240
query52	113	103	95	95
query53	234	252	191	191
query54	489	510	452	452
query55	82	80	82	80
query56	277	255	244	244
query57	1168	1214	1186	1186
query58	268	237	235	235
query59	3138	3321	3148	3148
query60	288	273	262	262
query61	117	113	124	113
query62	722	725	682	682
query63	229	188	186	186
query64	1255	1130	763	763
query65	3262	3149	3179	3149
query66	695	425	319	319
query67	16142	15836	15513	15513
query68	5087	826	523	523
query69	479	287	254	254
query70	1221	1154	1140	1140
query71	420	290	262	262
query72	6032	3877	3879	3877
query73	803	752	367	367
query74	10154	9185	8719	8719
query75	3193	3161	2650	2650
query76	3758	1177	765	765
query77	496	395	274	274
query78	10023	9930	9368	9368
query79	3470	801	602	602
query80	783	519	447	447
query81	515	279	241	241
query82	1134	152	127	127
query83	160	172	149	149
query84	294	93	77	77
query85	747	362	292	292
query86	385	305	308	305
query87	4480	4635	4355	4355
query88	4735	2203	2168	2168
query89	409	330	291	291
query90	1602	188	191	188
query91	129	135	107	107
query92	64	97	52	52
query93	2954	845	534	534
query94	728	393	293	293
query95	328	274	257	257
query96	495	624	293	293
query97	2845	2843	2750	2750
query98	219	202	190	190
query99	1636	1338	1245	1245
Total cold run time: 312095 ms
Total hot run time: 191679 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.69 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f5af9949369589b49af19a7c592b281a686e870d, data reload: false

query1	0.03	0.04	0.03
query2	0.08	0.04	0.04
query3	0.23	0.06	0.05
query4	1.66	0.08	0.08
query5	0.45	0.41	0.39
query6	1.16	0.66	0.65
query7	0.02	0.02	0.02
query8	0.05	0.05	0.05
query9	0.55	0.49	0.50
query10	0.56	0.56	0.56
query11	0.17	0.11	0.12
query12	0.15	0.13	0.13
query13	0.60	0.60	0.59
query14	2.74	2.74	2.75
query15	0.91	0.85	0.83
query16	0.39	0.37	0.38
query17	1.06	1.00	1.08
query18	0.18	0.18	0.19
query19	1.94	1.77	1.95
query20	0.02	0.02	0.01
query21	15.36	0.96	0.65
query22	0.76	0.78	0.69
query23	15.04	1.52	0.67
query24	2.19	0.36	0.22
query25	0.14	0.09	0.09
query26	0.28	0.19	0.18
query27	0.08	0.08	0.08
query28	13.41	1.25	0.54
query29	12.65	4.10	3.44
query30	0.24	0.08	0.05
query31	2.85	0.62	0.38
query32	3.22	0.56	0.48
query33	3.00	3.00	3.00
query34	16.43	5.14	4.52
query35	4.60	4.60	4.54
query36	0.61	0.50	0.47
query37	0.19	0.16	0.16
query38	0.16	0.16	0.14
query39	0.05	0.04	0.04
query40	0.16	0.14	0.13
query41	0.10	0.06	0.05
query42	0.06	0.04	0.05
query43	0.05	0.05	0.04
Total cold run time: 104.58 s
Total hot run time: 30.69 s

Nitin-Kashyap and others added 2 commits February 6, 2025 21:05
… generated by Spark. (27783)

    1. Original planner updated to consider BucketShuffle for bucketed hive table
    2. Neerids planner updated for bucketShuffle join on hive tables.
    3. Added spark style hash calculation in BE for shuffle on one side.
    4. Added shuffle hash selection based on left(non-shuffling) side.
@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch 2 times, most recently from c18dab8 to 9bc4e8b Compare February 6, 2025 17:11
@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 9bc4e8b to 1845b56 Compare February 6, 2025 17:15
@Nitin-Kashyap
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31798 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1845b561a09baaa68cf55c1a70c7216c9dad02ec, data reload: false

------ Round 1 ----------------------------------
q1	17664	5266	5107	5107
q2	2060	302	165	165
q3	10647	1277	736	736
q4	10238	1007	546	546
q5	7680	2364	2402	2364
q6	194	167	135	135
q7	910	753	592	592
q8	9291	1249	1111	1111
q9	4850	4847	4790	4790
q10	6826	2336	1897	1897
q11	500	274	261	261
q12	347	361	227	227
q13	17777	3684	3100	3100
q14	223	220	203	203
q15	528	474	463	463
q16	633	623	571	571
q17	567	878	354	354
q18	6620	6300	6205	6205
q19	1463	974	549	549
q20	317	323	198	198
q21	2903	2163	1917	1917
q22	384	332	307	307
Total cold run time: 102622 ms
Total hot run time: 31798 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5213	5108	5209	5108
q2	234	328	227	227
q3	2192	2702	2323	2323
q4	1483	1817	1373	1373
q5	4221	4148	4150	4148
q6	211	167	126	126
q7	1867	1819	1749	1749
q8	2620	2605	2619	2605
q9	7294	7258	7163	7163
q10	3035	3196	2816	2816
q11	594	516	489	489
q12	697	808	632	632
q13	3373	3894	3323	3323
q14	277	278	267	267
q15	511	465	464	464
q16	647	685	648	648
q17	1162	1600	1343	1343
q18	7669	7422	7259	7259
q19	831	883	1052	883
q20	1974	2027	1881	1881
q21	5504	4977	4972	4972
q22	632	570	536	536
Total cold run time: 52241 ms
Total hot run time: 50335 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 190190 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1845b561a09baaa68cf55c1a70c7216c9dad02ec, data reload: false

query1	1353	963	964	963
query2	6234	1838	1840	1838
query3	11120	4579	4560	4560
query4	56115	26208	23074	23074
query5	5120	541	494	494
query6	344	194	170	170
query7	4986	512	298	298
query8	319	226	221	221
query9	6191	2545	2559	2545
query10	397	320	265	265
query11	15149	15408	14843	14843
query12	159	109	108	108
query13	1078	501	392	392
query14	10343	6445	6955	6445
query15	214	209	180	180
query16	7137	648	504	504
query17	1069	704	571	571
query18	1518	467	318	318
query19	198	191	162	162
query20	128	126	119	119
query21	209	126	112	112
query22	4564	4615	4226	4226
query23	33982	33535	33444	33444
query24	5718	2445	2472	2445
query25	475	485	416	416
query26	701	273	166	166
query27	1844	491	356	356
query28	2946	2449	2442	2442
query29	598	570	439	439
query30	215	191	162	162
query31	880	864	850	850
query32	74	61	70	61
query33	454	352	310	310
query34	823	855	487	487
query35	802	829	735	735
query36	945	1026	915	915
query37	127	100	85	85
query38	4344	4213	4304	4213
query39	1516	1477	1433	1433
query40	219	114	101	101
query41	52	50	49	49
query42	121	112	104	104
query43	510	517	466	466
query44	1352	815	815	815
query45	190	178	172	172
query46	911	1075	664	664
query47	1851	1880	1789	1789
query48	388	424	318	318
query49	682	512	436	436
query50	723	755	449	449
query51	4246	4357	4213	4213
query52	107	108	104	104
query53	248	285	191	191
query54	476	498	433	433
query55	87	86	82	82
query56	285	282	274	274
query57	1180	1189	1120	1120
query58	277	251	249	249
query59	2798	2847	2768	2768
query60	310	307	300	300
query61	143	138	139	138
query62	736	755	707	707
query63	232	197	198	197
query64	1621	1034	730	730
query65	3312	3315	3191	3191
query66	780	393	296	296
query67	15879	15772	15322	15322
query68	5608	768	510	510
query69	538	304	267	267
query70	1207	1097	1034	1034
query71	459	292	262	262
query72	6053	3684	3805	3684
query73	1244	738	353	353
query74	8969	9355	9070	9070
query75	3349	3153	2703	2703
query76	3946	1181	747	747
query77	555	385	276	276
query78	10137	10146	9271	9271
query79	1950	794	594	594
query80	672	526	443	443
query81	491	281	241	241
query82	223	155	122	122
query83	171	167	152	152
query84	297	90	72	72
query85	726	396	301	301
query86	334	319	308	308
query87	4498	4595	4357	4357
query88	3053	2207	2179	2179
query89	403	323	284	284
query90	1838	190	194	190
query91	135	138	107	107
query92	76	59	56	56
query93	2151	995	573	573
query94	653	396	294	294
query95	344	273	262	262
query96	481	557	267	267
query97	2774	2808	2732	2732
query98	235	202	200	200
query99	1337	1414	1234	1234
Total cold run time: 295012 ms
Total hot run time: 190190 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.98 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 1845b561a09baaa68cf55c1a70c7216c9dad02ec, data reload: false

query1	0.03	0.03	0.04
query2	0.10	0.04	0.04
query3	0.28	0.05	0.06
query4	1.60	0.08	0.07
query5	0.42	0.39	0.40
query6	1.16	0.64	0.66
query7	0.02	0.02	0.02
query8	0.06	0.04	0.06
query9	0.63	0.51	0.53
query10	0.57	0.57	0.56
query11	0.24	0.12	0.13
query12	0.24	0.12	0.12
query13	0.63	0.61	0.61
query14	2.79	2.73	2.81
query15	0.98	0.88	0.87
query16	0.37	0.38	0.37
query17	1.05	1.02	1.04
query18	0.17	0.17	0.19
query19	1.96	1.83	2.05
query20	0.01	0.02	0.01
query21	15.39	0.96	0.65
query22	0.93	1.01	0.78
query23	14.76	1.56	0.75
query24	5.25	0.62	0.29
query25	0.15	0.09	0.09
query26	0.56	0.22	0.18
query27	0.09	0.08	0.08
query28	11.02	1.16	0.56
query29	12.53	4.09	3.32
query30	0.27	0.08	0.06
query31	2.84	0.62	0.42
query32	3.23	0.58	0.50
query33	3.02	3.06	3.04
query34	16.35	5.15	4.48
query35	4.54	4.52	4.46
query36	0.62	0.50	0.48
query37	0.20	0.16	0.16
query38	0.17	0.15	0.16
query39	0.04	0.04	0.04
query40	0.18	0.16	0.16
query41	0.10	0.04	0.05
query42	0.07	0.05	0.05
query43	0.05	0.04	0.04
Total cold run time: 105.67 s
Total hot run time: 30.98 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 42.04% (11020/26214)
Line Coverage: 32.33% (93056/287845)
Region Coverage: 31.48% (47711/151556)
Branch Coverage: 27.50% (24137/87782)
Coverage Report: http://coverage.selectdb-in.cc/coverage/1845b561a09baaa68cf55c1a70c7216c9dad02ec_1845b561a09baaa68cf55c1a70c7216c9dad02ec/report/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Enable BucketShuffle Join for Hive tables
7 participants